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(57) ABSTRACT 

A system for accessing documents contained in a remote 
repository, which change in content from version-to-version. 
The system allows users to specify lists of documents of 
interest. Based on the lists, the system maintains an archive, 
which contains a copy of one version of each listed 
document, and material from which the other versions can 
be reconstructed. The system periodically compares the 
archive with current versions of the documents located in the 
repository, and updates the archive, thereby maintaining the 
ability to reconstruct current versions. The system also 
monitors access to the versions by each user. When a user 
calls for a current version, the system presents the current 
version, and indicates what parts of the current version have 
not been previously accessed by the user. 
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1 2 

METHOD AND APPARATUS FOR FIG. 3 illustrates an ARCHIVE within the EXTERNAL 

TRACKING AND VIEWING CHANGES ON SERVICE, which contains copies of original versions of 

THE WEB PAGEs, and changes made to the original versions. 

REFERENCE TO A MICROFICHE APPENDIX s iUu ? tratCS 3 ***** g ^f^ the invention ' 

5 which hsts various versions of a PAGE. 

Included with and forming part of this specification is a FIG. 3B illustrates a display, generated by the invention, 

microfiche appendix, including 1 sheet having a total of 52 which identifies PAGEs contained in a hot list which have 

frames. changed. 

The invention concerns presentation of a current version FIG. 4 illustrates a current version of a PAGE, presented 

of a document retrieved from a data repository. The presen- i n a format which points out changes made since a previous 

tation indicates changes made in the document since the version. 

viewer accessed a previous version. FIG. 5 illustrates hot lists for two users, as compared to a 

BACKGROUND OF THE INVENTION sin S le ^ 35 in FIG - 2 * 

Tr t . v. L . tJ . ... . is FIG. 6 is a flow chart illustrating logic implemented by 

Information which is stored m computerized systems can _ r * .u ■ *• 

, r 4l - * * * jl . one torm of the invention, 
change frequently, and without notice. As an example, 

software under development frequently involves many FIG - 7 * a time-history of three PAGEs. 

persons, and is commonly stored at a central location. Each FIG. 8 is an architecture for part of one type of EXTER- 

person can change the software on an ad hoc basis, without 2Q NA L SERVICE. 

knowledge of others. FIG. 9 illustrates one form of the invention. 

In such systems containing changeable data, a person who FIG. 10 illustrates one form of the invention, 

examines information on a given day does not, in general, F IG. 11 illustrates output of HTMLDIFF, showing differ- 

know whether, and how, the information has changed since ences between a subset of two versions of 

a previous examination. Consequently, the person must 25 HTTP7/SNAPPLE CS WASHINGTON EDU 600/ 

spend time comparing currently available information with MOBILE/ 

previous versions of the information. . . . Im „ ... , „ . , 

, . . The original HTML source was edited manually to make the 

Software exists for facilitating this comparison. For reS ult fit onto one page; in practice, the highHghted changes 

example, systems known as "version control systems/' or would be interspersed among a much larger document, 

revision control systems," store data which represents 30 Small arrows point to changes, which are primarily addi- 

multiple versions of different documents, as indicated in tions in this case ^ ch k the «, ast dale „ date • 

FIG. 1A. In that Figure the D ATA « ^ indicated, together with an e le of text bein laced Here the , g amhor 

dashed loops which indicate the VERSIONS. had the changes manuaUy with small {cQns „ 

The loops indicate that the VERSIONS are contained in, well. The banner at the top of the page was inserted by 

and derivable from, the DATA. For example, each VER- 35 HTMLDIFF. 

SION can be stored in its entirety. Alternately, a single FIG. 12 illustrates version histories which give the user a 

VERSION can be stored in its entirety, and other VER- chance to compare any two versions, or to go directly to a 

SIONs can be stored in the form of differences between them selected version. 

and the single, entire VERSION. FIG. 13 illustrates output of W3NEWER, and shows a 

The version control system reconstructs any selected 40 number of anchors (the descriptive text originates from the 

VERSION for the user. hot list). The anchors marked "changed" have modification 

However, many such software systems suffer disadvan- dates after the time which the user's browser history indi- 

tages. In general, some systems notify users of the occur- cates the URL was last seen. Some URLs were not checked 

rences of changes, but do not identify the changes them- at all, and others were checked and are known to have been 

selves. Conversely, other systems identify the changes 45 see n DV the user. 

(geherically, these systems are known as "diff" systems), but FIG. 14 demonstrates use of a SNAPSHOT facility, which 

only in response to identification of a particular pair of allows a user to specify an operation on a URL. In this 

documents. example, DOUGUS@RESEARCH.ATT.COM is "remem- 
bering" URL HTTP:// 

SUMMARY OF THE INVENTION 50 SNAPPLE.CS.WASHINGTON.EDU:600/MOBILE/. 
One form of the invention observes a user's examination 
of a document contained in a repository. The invention then 
continually monitors that document for modifications. When 

the user examines the document at a later time, the invention ^ An illustrative embodiment of the invention is given in 

presents the document in the current, later, form, and indi- the discussion below, 
cates the modifications occurring since the user last viewed 

the document. Overview of Invention 

BRIEF DESCRIPTION OF THE DRAWINGS A commonly used repository of information is known as 

60 the World Wide Web, or WWW. In the WWW, providers of 

FIG. 1A illustrates a prior-art version control system. information make their information available to users in the 

FIG. IB illustrates selected concepts involved in hyper- form of "pages." Each page is assigned a name, which 

text information retrieval. distinguishes the page from other pages, and allows a user 

FIG. 1 illustrates an illustrative embodiment of the inven- to locate the page. 

tlon - 65 The WWW provides information using an information 

FIG. 2 illustrates a hot list, and copying PAGEs from a relrieval-and-display approach called "hypertext," In 

REPOSITORY to an EXTERNAL SERVICE. hypertext, a page may contain references to other pages, or 



DETAILED DESCRIPTION OF THE 
INVENTION 
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other documents. A user can call up a page which is EXAMPLE 

referenced, by clicking on the reference (called a URL, or ' 

Universal Resource Locator) with a pointing device. FIG. Single User 

IB provides an example. Operation with respect to a single user will first be 
In FIG. IB, a document D is displayed to a user. Refer- 5 • explained. FIG. 2 shows a hot list 4, submitted by USER 1, 

ences R refer to other documents. For example, Rl refers to wmcn identifies pages A and B as being of interest to USER 

Dl, R2 refers to D2, and so on. The referenced documents \ ^ invention allows the user to modify the hot list at later 

themselves may contain their own references to other times. In response to the hot list, the invention copies pages 

documents, such as R4, which refers to D4. A and B from tne REPOSITORY, as indicated by the dashed 

A user can retrieve a referenced document D, by clicking 30 arrow s. These PAGEs wUl be termed "base pages." At this 

on the reference R which refers to it. For example, clicking °^ m ^ ° f PAGE ^ A a ^ ™ ^ 

on Rl causes retrieval and display of Dl. REPOSITORY, and copies reside in the EXTERNAL SER- 

Under the invention, a user of the WWW initially iden- ™* . . . ,. t1 . . . . , . 

tines pages of interest. Document D in FIG. IB represents „ T^, ° f 

one page. These selected pages form a "hot list." l£en, the 15 f P *f \ ^1° REPOS * TORY > for chan S es - 

invention does the following- In lookm g for chan S es > the invention first performs a pre- 

. , , . liminary check, based on information such as (1) dates of 

(a) Copies the hot-listed pages into an archive, which is a modification and (2) checksums. 

storage location separate from the WWW, and under ^. 4 c A * a i_ jj j . . 

:»a-Z*~a~~. t i a Ct *u. - .L. -i Dates or modification may be added to a PAGE by the 

independent control. After the copying, the ongmal 2 (\ dapc -a j * j- .i • j- . . , , 

™„t „„^ *~ -a • »u nram a PAGE provider. These dates directly indicate whether the 

pages continue to reside in the WWW, and copies • u l- j u « « 

reside in the archive originally archived version has changed. 

/ u \ •* . i * *• tL • * i c Checksums are generated by the invention. An example of 

(b) Monitors, at later times, the original pages for t_ i • *l * i r n l. . i* 
chan es and archives th ch iUi a checksum is the numencal sum of all characters in a line, 

* * 1 anges. or on a page. If a checksum changes (indicating that the 

(c) Records the times when the user later accesses each 25 number of characters has changed), the change indicates a 
hot-listed page. mgn probability that a change has occurred in the PAGE. (In 

(d) Whenever the user accesses a hot-listed page, presents practice, the checksums used are more complex than this 
the user with simple example illustrates. Checksums are known in the art.) 

i) the current version of the page (which may differ If the preliminary check, either by dates of modification 
from the initial copy which was stored in the or checksums, indicates that changes have occurred, then the 
archive); and invention copies the present version of the PAGE into the 

ii) an option to compare selected versions of the page. EXTERNAL SERVICE, and compares it with the base page, 
The comparison is presented by performing a differ- in order to locate the changes. Computer programs for 
encing operation on pairs of versions. detecting such changes are known in the art, and some 

e) As an option, the invention also implements the steps examples are given in the TECHNICAL APPENDIX. A 

described above with respect to documents referenced preferred program, not known in the prior art, is entitled 

by the page. For example, in FIG. 1A, if a user is W3NEWER, and was developed by the inventors, 

viewing document D, the invention can present the W3NEWER is contained in the listing located at the end of 

current version of reference document D2, together 4Q this Specification. 

with a history of D2. When changes are found, the invention stores them in the 

EXTERNAL SERVICE. FIG. 3 illustrates storage of the 

More Detailed Description changes, by the small boxes 6 located below PAGEs A and 

u j. t • i tj c* a • cvtcdm a t B< The DATEs within the boxes 6 indicate the dates on 

Hot-List Pages are Stored in EXTERNAL . . , . , . 

SERVICE 45 changes were saved. 

FIG. 3 A illustrates how the invention displays the history 

FIG. 1 illustrates a REPOSITORY of information, such as of versions. Column 7 indicates the number assigned to each 

the WWW. For assistance in accessing the REPOSITORY, version by the invention. Column 8 indicates the times when 

the invention provides the EXTERNAL SERVICE which me respective versions were retrieved by the invention, 

includes: 5Q Column 8A allows a user to select a, pair of versions for a 

(a) SOFTWARE, such as that provided in the COM- differencing operation, as discussed below. 

PUTER PROGRAM LISTING herein, For ease of explanation, FIG. 3 illustrates storage of base 

(b) a SERVER, or other computer, which runs the pages, which are early versions of PAGEs, together with 
software, and subsequent changes, indicated by the boxes 6. However, in 

(c) COMMUNICATION SYSTEMS which link with both 55 practice, it can be more efficient to perform storage in a 
the users and the REPOSITORY. reversed sense, by storing the latest version as the base page 

The SERVER and the COMMUNICATION SYSTEMS (instead of the early version) and storing the changes 6 from 

located within the EXTERNAL SERVICE are known in the which earl Y versions can be reconstructed. One reason is that 

art. As indicated in the Figure, the EXTERNAL SERVICE users arc expected to call for latest versions more frequently 

is distinct from the REPOSITORY, and under separate 60 tnaD earlv versions. Storage of the entire latest versions 

control. eliminates the need to reconstruct them. 

The invention does not disrupt the users' normal interac- The changes, together with their base pages, form an 

tion with the REPOSITORY; the users can interact with both archive, which allows reconstruction of a PAGE as of any 

the REPOSITORY, as usual, and also with the EXTERNAL date desired. For example: 

SERVICE. Dashed arrows 3 indicate the interaction. Several 65 PAGE A itself (ie, the base page), plus the changes labeled 

examples will provide illustrative modes of operation of the DATE 1, allow reconstruction of the version of PAGE 

invention. A, as of DATE 1. 
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PAGE A itself, plus the changes labeled DATE 1 and 
DATE 2, allow reconstruction, as of DATE 2, and so 
on. 

When USER 1 wishes to view PAGE A, the invention 
ordinarily retrieves and presents the current version. The 
invention also provides an option for reconstructing the 
PAGE, as of a date specified by the user, and presents it in 
the format shown FIG. 4. The program HTMLDIFF, con- 
tained in the listing, generates the image shown in FIG. 4. 
The content of the page can be divided into three classes. 

The first class contains material which has not changed. 
This class of material is displayed in the font, size, color, and 
background, as customary in documents downloaded from 
the REPOSITORY 

The second class represents changes, and contains mate- 
rial not present in the base page, but which has been added. 
Brackets 9 indicate such material. (The brackets 9 are part 
of FIG. 4, and are not necessarily part of the page generated 
by the invention.) This material is presented in a particular 
font, particular size, particular color, and particular back- 
ground. The choice of these parameters can be varied but, in 
general, they should be chosen to maximize contrast with the 
first class of material. In addition to the formatting described 
immediately above, the added material is further highlighted 
by arrows 7. 

The third class contains material which was deleted from 
the base page. Deleted material can be handled in at least 
three ways. One, deleted material can be simply deleted, so 
that the page presented to the reader contains no reference to 
the deleted material. 

Two, the deleted material can be deleted, but a reference 
indicating the deletion is added, such as the phrase "Deleted 
material occurs here," In this case, the user can be given the 
option of fetching the deleted, non -visible, material. 

Three, deleted material can be presented, but indicated as 
deleted, as by "redline" format, in which a horizontal line, 
perhaps red in color, is drawn through the deleted material. 

FIG. 3B illustrates a display, generated by the invention, 
which indicates which PAGEs on a user's hot list have 
undergone changes. 

SECOND EXAMPLE 

Multiple Users 

In actual practice, multiple users are expected to use the 
invention. Each of them submits a hot list. In one approach 
of the invention, the procedure undertaken for a single user 
(described above) is repeated for multiple users: all PAGEs, 
on all hot lists, are copied into the EXTERNAL SERVICE. 
Then, for each hot list, the originals of the PAGES, located 
within the REPOSITORY, are monitored for changes, and 
the changes are retrieved into the EXTERNAL SERVICE, as 
described above. 

However, this approach contains inefficiencies. For 
example, a given PAGE will probably be identified by more 
than one hot list. Repeatedly copying that PAGE, for each 
hot list, would entail storage of multiple copies of the same 
PAGE. Further, repeatedly comparing the multiple copies 
with their originals in the REPOSITORY represents a waste 
of computer time: a single comparison would suffice. The 
invention reduces these inefficiencies by the approach 
shown in FIG. 5. 

This Figure represents a modification of FIG. 4, to which 
a hot fist for USER 2 has been added. The added hot list 
specifies PAGES A and C. 

To process the new hot list, the invention first checks 
whether the PAGEs identified on the added hot list are 
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archived within the EXTERNAL SERVICE. Since PAGE A, 
plus its changes, are already contained within the archive, 
that PAGE is not copied. But PAGE C, which is not present 
in the ARCHIVE, is archived, as indicated by the dashed 
arrow. 

At this time, all PAGEs identified on all hot lists are 
contained within the archive. To emphasize this fact, PAGE 
A is indicated twice: once for USER 1, and a second time by 
a dashed page 14, for USER 2, although, as stated above, 
PAGE A is stored only once. 

After archiving all necessary PAGEs, the originals, 
located within the REPOSITORY, are periodically moni- 
tored for changes, as described above. The changes are 
copied to the archive of the EXTERNAL SERVICE. 

Flow Chart 

An exemplary flow chart is shown in FIG. 6, which refers 
to a single-user case. In block 20, the EXTERNAL SER- 
VICE accepts hot lists from users. Then, in block 23, the 
EXTERNAL SERVICE checks whether the PAGEs identi- 
fied on the hot lists are contained within the archive. If not, 
the PAGEs are copied from the REPOSITORY, as indicated 
by block 26. 

Then the logic proceeds to block 29, where the originals 
of the PAGEs, located in the REPOSITORY, are examined 
for changes. The examination can include the preliminary 
checks (for checksums and dates of modification) discussed 
above. When changes are found, the entire PAGE containing 
them is downloaded to the EXTERNAL SERVICE, and the 
changes, indicated by blocks 6 in FIG. 3, are derived. Block 
32 indicates relevant information stored in the EXTERNAL 
SERVICE. 

As users access the PAGEs, block 35 monitors the times 
of the accesses, in order to identify which versions of each 
PAGE the user viewed last. These times are stored, as 
indicated by block 32 and dashed arrow 37. These times are 
used to determine which changes in FIG. 4 are to be 
identified as new material, when a PAGE is called by each 
user. An example will illustrate. 

FIG. 7, top, illustrates the time-history of changes made 
to PAGE A. USER 1 accessed this PAGE at time 2, as 
indicated. Block 35 in FIG. 6 monitors and records this time 
(at TIME 2 in FIG. 7, and not earlier, of course). 

If USER 1 again accesses the PAGE at time 5, then the 
invention presents VERSION 1 to the USER. However, if 
the user accesses the PAGE at time 11, VERSION 2 had 
been created since the last access by USER 1. The invention 
had previously identified the changes, and copied them as 
indicated in FIG. 3. Now, at the access at time 11, the 
invention presents VERSION 1, plus the changes which 
make VERSION 2, because block 35 in FIG. 6 indicates that 
the USER has not seen VERSION 2. 

Returning to the flow chart of FIG. 6, block 39 indicates 
that, when a USER calls for a PAGE, the invention presents 
the current version, and indicates the changes made (as in 
FIG. 4) since the USER last accessed that page. In the 
example immediately above, the invention presents VER- 
SION 2 of PAGE A, as in FIG. 7, and indicates the changes 
made since VERSION 1, because VERSION 1 was the last 
accessed by USER 1. 

The flow chart of FIG. 6 should not be read as limiting the 
invention to a linear, sequential mode of operation. In 
practice, multiple users can present hot lists simultaneously, 
and other operations shown in the flow chart can also occur 
together. 
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THIRD EXAMPLE 



Additional Considerations 

1. One definition of "page" is that it refers to a unit of data, 
stored in a system, which is identified by a specific name. (In 
the WWW, all pages have unique names.) Other terms can 
refer to such units of data, such as "files*' and "documents." 
In general, the particular name used will depend on the 
system storing the data. 

2. One definition of "repository" is a collection of data, 
which is accessible by computer. The repository may be 
available to the public, or access may be limited. In general, 
repositories are expected to be distributed, meaning that the 
storage locations are physically distributed over a wide 
geographic area, and linked together by a communication 
system. 



20 



25 



Notification of Changes 

The invention can notify USERs when changes in their 
hot-listed PAGEs occur, as indicated by the dashed block 40 
in FIG. 6. This notification can take the form of a flag which 
is associated with the BASE PAGE in FIG. 8. When the 
USER logs into the EXTERNAL SERVICE, the invention 
notifies the USER of the changes to the respective PAGEs. 
FIG. 3B illustrates one approach to identifying PAGEs 
which have changed. 

Other types of notification are possible. For example, the 
invention need not wait for a user to access a PAGE. The 
invention can notify the user when changes have been found, 
as by sending an electronic mail message to the user. 

FOURTH EXAMPLE 

Common Hot List 

The invention can maintain a predetermined hot-list, for 
a community of users. This hot list contains a list of PAGEs 
which are considered to be of general interest to the com- 
munity. This hot fist, and the PAGEs identified on it, are 
made publicly available, to all users, but on a read-only 
basis. Users cannot modify the hot list, or the pages. 

This predetermined hot list can serve as an instructional 
tool, to educate users in the operation of the invention, and 
to demonstrate desirable features. 



One Architecture of Data Storage 

An illustrative approach to storage of the information 
identified in block 32of the flow chart of FIG. 6 is illustrated 
in FIG. 8, which is explained with reference to FIG. 7. 

FIG. 7 illustrates hypothetical changes to the three PAGEs 
identified by the two hot lists of FIG. 5. PAGE A underwent 
changes at times 7 and 13. Page B underwent changes at 
time 10, and so on. 

In FIG. 8, the arrows extending from the symbols "USER 
1", etc., indicate the times of access by the users. For 
example, USER 1 accessed PAGE A, VERSION 1, at time 
2. USER 1 then accessed PAGE A, VERSION 2, at time 9, 
and so on. 

The invention maintains a TABLE of these times, as 
indicated on the right side of FIG. 8, together with a list of 45 
PAGEs, or documents, owned by each USER. Ownership is 
determined by the hot lists. The invention also maintains (a) 
the BASE PAGES, (b) the changes to each, and (c) the times 
of each change, as indicated on the left side of the Figure. 
From this data, the invention is able to reconstruct any 
PAGE, as of any date subsequent to the date of the BASE 
PAGE. 



3. It was stated above that the invention can reconstruct a 
page as of any selected date. The reconstruction is based on 
the changes 6 in FIG. 3. These changes are detected 
periodically, and the periodicity is determined by each user 

5 of the system, subject to limits imposed by the designer and 

system administrator. 
For example, user A can specify a period of one day for 

checking for changes in the pages on user A's hot list; user 

B can specify a different period for B's pages, such as one 
30 week. The system administrator can specify that no period, 

for any user, can be shorter than one hour. 

Consequently, changes in a page, located in the 

REPOSITORY, will only appear in a reconstruction done by 

the EXTERNAL SERVICE after the changes have been 
15 detected, and not earlier. An example will illustrate this 

distinction. 

Assume that the invention looks for changes on odd- 
numbered dates. Thus, a change occurring on the fourth of 
a month will be detected on the fifth. However, if a user 
happens to call for reconstruction on the fourth, the change 
occurring on the fourth will not appear in the reconstruction. 
Only changes occurring as of the prior detection, namely, as 
of the third, will appear. 

It is expected that the detection process will be performed 
sufficiently often that the influence of this factor will be 
negligible. 

4. The invention can extend its differencing function (ie, 
the examination of pages for changes) to pages referenced 

30 by the page accessed by the user. For example, if the user 
accesses document D in FIG. IB, the invention can detect 
changes in all documents referenced by document D, such as 
Dl, D2, and D3. 

In another embodiment, the differencing can extend to the 
35 documents which are, in turn, referenced by the referenced 
documents. For example, the referenced documents (Dl, 
D2, and D3) refer to D5 and D6. These latter documents (D5 
and D6) can be differenced also, as can be the documents 
which they reference, and so on. 

5. The invention provides all information from which a 
current version of a PAGE may be derived. FIG. 4 gives an 
example. FIG. 4 contains all such information, together with 
other information which indicates changes since a previous 
version. 

6. The discussion above presumed that comparison, or 
differencing, between different versions of a PAGE was 
done within the EXTERNAL SERVICE. This is not strictly 
necessary; the comparison can be done at any convenient 
location. Further, the preliminary checking for the existence 
of changes can be done at any convenient location. 

7. In data storage systems, names are given to the units of 
information (e.g., documents, pages, records), although the 
names can be different in different databases. However, the 
names of the units, in general, remain the same throughout 
time, despite changes which are made to the information 
contained in the unit. Therefore, one definition of the term 
"version" refers to a unit of information, which is different 
from a previous unit of the same name. 

8. The REPOSITORY in FIG. 1 is, in general, located 
remotely from the EXTERNAL SERVICE. Communication 
is undertaken by any convenient approach, such as a public- 
access communication network known as the INTERNET. 

In general, the REPOSITORY is under independent con- 
trol of the EXTERNAL SERVICE. One ramification of this 
independent control is that the type of processing done to the 
PAGEs copied into the EXTERNAL SERVICE is controlled 
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by the EXTERNAL SERVICE, and not by the REPOSI- The differencing program contained in the COMPUTER 
TORY. For example, (a) the particular processes used in PROGRAM LISTING compares different versions on a 
locating and storing differences, (b) the frequency of subunit-by-subunit basis. For example, the program corn- 
processing, and (c) the mode of notifying a user, are con- pares corresponding sentences in different versions, and the 
trolled by the designer of the EXTERNAL SERVICE. The s sentences are detected by sentence terminators. (Longer 
operator of the REPOSITORY has no involvement in this subunits can be used, such as paragraphs or pages.) The 
processing. sentence terminators are a subset of the markup language. 

9. FIG. 9 illustrates another form of the invention. The Specifically, the terminators are format-defining codes, 
invention maintains base pages 30 within the EXTERNAL 

SERVICE, as required by the hot lists 36. The base pages 30 1Q 

were downloaded from respective repositories 42 A, 42B, 

etc. COMPUTER PROGRAM LISTING 



The invention periodically monitors the originals 30A of ^ program listing ^ ^vided into three sections, 

the pages, located in the repository 42, for changes, and i. htmldiff; comprising: 

stores the changes within the EXTERNAL SERVICE. The - htmLdiEsmi (5 pages), 

invention notifies users when changes are found in pages on 35 " ( 3 pages), 

their hot lists (notification is not shown). " ^.smi (4 pages), and 

, „ - - html.lex (one page). 

A version control system 39 allows users to fetch and 2. W3NEWER (17 pages), 

view any version of any page. 3. nohands, comprising: 

10. The different versions of documents may contain " aohandsBE (11 pages), 
drawings, files from which sound maybe generated, files 20 " no-hands.cgi (3 pages) 

• • • j -j i- 1 . . * - rcsdiff.cgo (4 pages), and 

which produce video clips and animation, and other com- .. snapshot^ (3 pages). 

ponents which do not consist Strictly of alphanumeric char- NOHANDS is an overall program set which utilizes W3NEWER and 

acters. The invention detects the existence of changes in htmldiff. 
such components, and marks the existence of the changes, in 

the display as shown in FIG. 4, without necessarily identi- 25 A M of tools that detect when Wor i d _ Wid e-Web pages 

fying in detail the nature of the changes. have t^en modified and present the modifications visually to 

11. A primary use of the invention is envisioned in the me user through marked-up HTML. The tools consist of 
situation shown in FIG. 10. The EXTERNAL SERVICE three components: w3newer, which detects changes to 
obtains copies of PAGEs from a REPOSITORY, such as 3Q pages; snapshot, which permits a user to store a copy of an 
WWW. However, the EXTERNAL SERVICE is given no arbitrary Web page and to compare any subsequent version 
authority to replace or modify the pages contained in the 0 f a pa g e w i tn t he saved version; and htmldiff, which marks 
REPOSITORY. To the EXTERNAL SERVICE, the PAGEs up HTML text to indicate how it has changed from a 
represent read-only data, as indicated by the "X" over arrow previous version. The tools are referred to collectively as the 
50, which indicates a write operation. ^ Network-Oriented HTML Archival, Notification, and Differ- 

The EXTERNAL SERVICE performs differencing encing System (No HANDS). Presented are several aspects 
between currently copied versions of pages, and DATA of NO HANDS, with an emphasis on systems issues such as 
representing previous versions. The DATA stored in the scalability, security, and error conditions. 
EXTERNAL SERVICE can be both read, and written to, by Use of the World-Wide -Web (W 3 ) has increased dramati- 
ze EXTERNAL SERVICE. The EXTERNAL SERVICE 4Q ca n y over the past couple of years, both in the volume of 
reconstructs any version on demand, and also indicates traffic and the variety of users and content providers. The W 3 
differences between any two versions selected by a user, as has become an information distribution medium for aca- 
discussed above. These functions can be accomplished by a ^emic environments (its original motivation), commercial 
prior-art Revision Control System, RCS (also called a Ver- 0 nes, a nd virtual communities of people who share interests 
sion Control System), or by the code contained in the listing 45 in a wide variety of topics. Information that used to be sent 
contained in this Specification. out over electronic mail or USENET, both active media that 

12. In one form of the invention, the PAGEs retrieved are go to users who have subscribed to mailing lists or 
written in a "markup language" such as HyperText Mark-up newsgroups, can now be posted on a W 3 page. Users 
Language (HTML). A mark-up language, in general, con- interested in that data then visit the page to get the new 
tains two types of codes, interspersed among the actual text 50 information. 

of a document. The URLs of pages of interest to a user can be saved in 
One type indicates how the PAGEs are to be displayed. a "hotlist" (known as a bookmark file in Netscape™), so 
For example, some codes indicate paragraph indentation, they can be visited conveniently. How does a user find out 
other codes indicate font styles, yet other codes indicate when pages have changed? If users know that pages contain 
style of font, within a font, such as italicizing, underlining, 55 up-to-the-minute data (such as stock quotes), or are f re- 
double-striking, or bold printing. This type of code is quently changed by their owners, they may visit the pages 
referred to as format-defining. often. Other pages may be ignored, or browsed by the user 

A second type of code can identify an image, such as a only to find they have not changed, 

bit-mapped file located elsewhere. When such a code is read In recent months, several tools have become available to 

by the system displaying the PAGE, a copy of the image is 60 address the problem of determining when a page has 

retrieved, and displayed within the PAGE, at the location changed. One example of such a tool is, webwatch, a 

specified by the code. This type, of code is referred to as product for Windows™ that uses the HTTP HEAD com- 

content-defining. mand to find out when a page has been modified since it was 

The invention does not treat changes in the format- last viewed by a user's web browser, and generates a report 

defining codes as changes in content. Thus, a PAGE which 65 in HTML that allows the user to go directly to those updated 

changes in layout, or typestyle, only, is not designated as a pages. Another example is w3new, by Brooks Cutter, a 

changed page. public-domain perl script that runs on UNIX® as shown in 
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"B. B. Cutter 111. w3new. http://www.stuff.com/bcutter/ 
programs/w3new/w3new.htm 1" . 

Each of these tools suffers from a significant deficiency: 
while they provide the user with the knowledge that the page 
has changed, they do not show how the page has changed. 
Although a few pages are edited by their maintainers to 
highlight the most recent changes, often the modifications 
are not prominent, especially if the pages are large. Even 
pages with special highlighting of recent changes are prob- 
lematic: if a user visits a page frequently, what is "new" to 
the maintainer may not be "new" to the user. Alternatively, 
a user who visits a page infrequently may miss changes that 
the maintainer deems to be old. 

A system has been developed that efficiently tracks when 
pages change, compactly stores versions on a per-user basis, 
and automatically compares and presents the differences 
between pages. NO HANDS (Network-Oriented HTML 
Archival Notification, and Differencing System) provides 
"personalized" views of versions of W 3 pages with three 
tools. The first, w3newer, is a more scalable version of 
Cutter's w3new modification tracking tool that periodically 
accesses the W 3 to find when pages on a user's hotlist have 
changed. The second, snapshot, allows a user to save ver- 
sions of a page end later use a third tool, htmldiff to see how 
it has changed. Htmldiff automatically compares two HTML 
pages end creates a "merged" page to show the differences 
with special HTML markups. 

While NO HANDS can help arbitrary users track pages of 
interest, it can be of particular use in a collaborative envi- 
ronment. Consider a software development project that is 
geographically distributed across several locations. The W 3 
can be used to collect requirements, meeting notes, code, 
documentation, bug reports, and so on, so that everyone 
involved with the project has a consistent and up-to-date 
view of the project. As documents change, each project 
member will want to know what's "new" in their world, 
without having to waste time browsing documents. The 
w3newer component of NO HANDS automatically provides 
this information. Furthermore, what is "new" to one project 
member will be "old" to another, so that the notion of a 
document version must be "personalized" rather than global 
to the entire project. NO HANDS supports personalized 
versioning of documents with snapshot and uses htmldiff to 
provide a personalized version of "what's new" in a docu- 
ment. 

There has been a great deal of interest lately in finding out 
when pages on the W 3 have changed. Discussed below is 
related work, issues of scalability and cache consistency, and 
how to handle possible error conditions. 

Two tools, webwatch for Windows and w3new for UNIX, 
were mentioned above. Another similar tool is shown in "M. 
Newbery. Katipo. http://www.vuw.ac. nz./newbery/ 
Katipo.html", which runs on the Macintosh™, and yet 
another, URL-minder as shown in "Url-minder, hup:// 55 
www.netmind.com/URL-minder/URL-minder.html", which 
runs as a service on the W 3 itself and sends email when a 
page changes. Those that run on the user's host use the 
"hotlist" to determine which URLs to check, while URL- 
minder acts on URLs provided explicitly by a user via an 
HTML form. 

There are two basic strategies for deciding when a page 
has changed. Most tools use the HTTP HEAD command to 
retrieve the Last-Modified field from a W 3 document, either 
returning a sorted list of all modification times or just those 
times that are different from the browser's history (the 
timestamp of the version the user presumably last saw). 
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URL-minder uses a checksum of the content of a page, so it 
can detect changes in pages that do not provide a Last- 
Modified date, such as output from Common Gateway 
Interface (CGI) scripts. W3new (and therefore w3newer) 
requests the Last-Modified date if available; otherwise, it 
retrieves and checksums the whole page. Changes are gen- 
erally reported to the user in the form of an HTML page with 
links to each of the pages being tracked, although it can also 
be done via email as with URL-minder. 

These tools also vary with respect to frequency of check- 
ing and where the checks are performed. Most of the tools 
automatically run periodically from the user's machine. All 
URLs are checked each time the tools run, with the possible 
exception of URL-minder, which runs on an Internet server 
and checks pages with an arbitrary frequency that is guar- 
anteed to be at least as often as some threshold, such as a 
week (URL-minde's implementation is hidden behind a CGI 
interface). 

The tools described above poll every URL with the same 
frequency. The w3new was modified to make it more 
scalable, as well as to integrate it with the other components 
of NO HANDS, W3newer runs on the user's machine, but 
it omits checks of pages already known to be modified since 
the user last saw the page, and pages that have been viewed 
by the user within some threshold. The time when the user 
has viewed the page comes from the W~ browser's history. 1 
The "known modification date" comes from a variety of 
sources: 

a cached modification date from previous runs of 
w3newer; 

a modification date stored in a proxy-caching server's 
cache; or 

the HEAD information provided by httpd (the HTTP 
server) for the URL. 

If either of the first two sources of the modification date 
indicate that the page has not been visited since it was 
modified, then HTFP is used only if the time the modifica- 
tion information was obtained was long enough ago to be 
considered "stale" (currently, the threshold is one week). 

In addition, there is a threshold associated with each page 
to determine the maximum frequency of direct HEAD 
requests. If the page was visited within the threshold, or the 
modification date obtained from the proxy-caching server is 
current with respect to the threshold, the page is not checked. 
The threshold can vary depending on the URL, with perl 
pattern matching used to determine what threshold to apply. 
The first matching pattern is used. Table 1 gives an example 
of a □w3newer_thresholds configuration file. Thresholds 
are specified as combinations of days (d) and hours (h), with 
0 indicating that a page should be checked on every run of 
w3newer and never indicating that it should never be 
checked. 

TABLE 1 

An example of the thresholds specified to w3newer. 

# Comments start with a sharp sign. 

# perl syntax requires that be escaped 

# Default is equivalent to ending the file with ".*" 
Default 2d 
file:.* 0 
htWfrywwVyahQQlcpm/,* 7d 
http:www\.research\.att\com/.* 0 

h«p;ff.*\.att\.CQm/.* lh 

http://home\. mcom\.com/honse/whatsnew/- 12h 
whats_new\.html 
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TABLE 1-continued 

An example of the thresholds specified to w3newer. 

http :/Avww\.ncsa\.uiuc\.edu/SDG/Software/- 12h 
Mosaic/ Docs/whats-new\htm] 

http ://snapple\.cs\. washington\.edu:600/- Id 
mobile/ 

# rarely modified 

http ;//www\,cs\,dyKg\,g^u/ "pK/- 7d 
HomePageV html 

# this is in my hotlist but will be different every day 
http://www\.unitedmediaVcom/- never 
comics/dilbert/ 



Determining when HTTP pages have changed is analo- 
gous to caching a file in a distributed file system and 
determining when the file has been modified. While file 
systems such as the Andrew File System in "J. Howard et al. 
Scale and performance in a distributed file system. ACM 
Transactions on Computer Systems, 6(1):51^81, February 
1988"; and Sprite in "M. Nelson, B. Welch, and J. Ouster- 
hout. Caching in the Sprite network file system. ACM 
Transactions on Computer Systems, 6(1): 134-154, February 
1988" provide guarantees of cache consistency by issuing 
call-backs to hosts with invalid copies, HTTP access is 
closer to the traditional NFS approach as shown in "R. 
Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon. 
Design and implementation of the Sun network filesystem. 
In Proceedings of the USENIX 1985 Summer Conference, 
pages 119-130, June 1985", in which clients check back 
with servers periodically for each file they access. Netscape 
can be configured to check the modification date of a cached 
page each time it is visited, once each session, or not at all. 
Caching servers check when a client forces a full reload, or 
after a time-to-live value expires. 

Here the problem is complicated by the target environ- 
ment: one wishes to know not only when a currently viewed 
page has changes, but also when a page that has not been 
seen in a while has changed. Fortunately, unlike with file 
systems, HTTP data can usually tolerate some inconsistency. 
In the case of pages that are of interest to a user but have not 
been seen recently, finding out within some reasonable 
period of time, such as a day or a week, will usually suffice. 
Even if servers had a mechanism to notify all interested 
parties when a page has changed immediate notification 
might not be worth the overhead. 

Instead, one could envision using something like th e 
Harvest replication and caching services a shown in "C. Mic 
Bowman et al. Harvest: A scalable, customizable discovery 
and access system. Technical Report CU-CS-732-94, Dept. 
of Computer Science, University of Colorado — Boulder, 
March 1995", to notify interested parties in a lazy fashion. 
A user who expresses an interest in a page, or a browser that 
is currently caching a page could register an interest in the 
page with its local caching service. The caching service 
would in turn register an interest with an Internet-wide, 
distributed service that would make a best effort to notify the 
caching service of changes in a timely fashion. (This service 
could potentially archive versions of HOP pages as well). 
Pages would already be replicated, with server load 
distributed, and the mechanism for discovering when a page 
changes could be left to a negotiation between the distrib- 
uted repository and the content provider: either the content 
provider notifies the repository of changes, or the repository 
polls it periodically. Either way, there would not be a large 
number of clients polling each interesting HTTP server. 
Moving intelligence about HTTP caching to the server has 
been proposed by James S. Gwertzman and Margo Seltzer in 



56,933 Bl 

14 

"The case for geographical push-caching. In Proceedings of 
the Fifth Workshop in Hot Topics in Operating Systems (HO 
TOS-V), pages 51-55, Orcas Island, Wash., May 1995. 
IEEE" and others. 

5 One could also envision integrating the functionality of 
NO HANDS into file systems. Tools that can take actions 
when arbitrary files change are not widely available, though 
they do exist as in "Sun Microsystems. The HotJava Brows- 
ers; A White Paper Available as http://java.sun.com/ 

10 1.0alpha3/doc/overview/hotjava/b rowser.whitepapers.ps". 
Users might like to have a unified report of new files and W 3 
pages, and w3newer supports the "file:" specification and 
can find out if a local file has changed. However, snapshot 
has no way to access a file on the user's (remote) file system. 

1S Moving functionality into the browser would allow indi- 
vidual users to take snapshots of files that are not already 
under the control of a versioning system such as the Revi- 
sion Control System (RCS) as shown in "W. Tichy. RCS: a 
system for version control. Software-Practice & Experience. 

20 15(7):637-654, July 1985"; this might be an appropriate use 
of a browser with client-side execution, such as HotJava in 
"Sun Microsystems. The HotJava Browser: A White Paper 
Available as http://java.sun.eom/l.0alpha3/doc/overview/ 
hotjava/browser. whitepapers.ps". 

25 When a periodic task checks the status of a large number 
of URLs, a number of things can go wrong. Local problems 
such as network connectivity or the status of a proxy - 
caching server can cause all HTTP requests to fail. Proxy- 
caching servers are sometimes overloaded to the point of 

30 timing out large numbers of requests, and a background task 
that retrieves many URLs in a short time can aggravate their 
condition. W3newer should therefore be able to detect cases 
when it should abort and try again later (preferably in time 
for the user to see an updated report). 

35 At the same time, a number of errors can arise with 
individual URLs. They can move, with or without leaving a 
forwarding pointer. The server for a URL can be deactivated 
or renamed. They may disallow retrieval by "robots," mean- 
ing that any program that follows the "robot exclusion 

40 protocol A standard for robot exclusion, http// 
web.nexor.co.uk/mak/doc/robots/norobots.html" will not 
retrieve them. Since the cost of retrieving modification dates 
is small in comparison to the cost of retrieving robots.txt 
(part of the exclusion protocol), it may well be appropriate 

45 to ignore the robot exclusion protocol for this task, or to 
check robots.txt only occasionally on each host. Observing 
the protocol will still be advisable for hosts on which many 
URLs are checked, especially if the pages' contents are 
retrieved each time. 

50 Finally, automatic detection of modifications based on 
information such as modification date and checksum can 
lead to the generation of "junk mail" as "noisy" modifica- 
tions trigger change notifications. For instance, pages that 
report the number of times they have been accessed, or 

55 embed the current time, will look different every time they 
are retrieved. 

W3newer attempts to address these issues by the follow- 
ing steps: 

If a URL is inaccessible to robots, that fact is cached so 
60 the page is not accessed again unless a special flag is set 
when the script is invoked. 
Another flag can tell w3newer to treat error conditions as 
a successful check as far as the URL's times-tamp goes. 
For instance, if w3newer runs daily and checks a 
65 particular URL every four days, normally an error 
accessing the page on Monday will cause it to be 
checked again on Tuesday. With this flag, it would be 
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checked again on Friday. In general, it seems that errors containing the snapshot script. HTML supports a BASE 

are likely to be transient, and checking the next time directive that makes relative links relative to a different 

w3newer is run would be reasonable. URL, which mostly addresses this problem; however, 

When a URL is inaccessible, an error message appears in Netscape 1.1 N treats internal links within such a document 

the status report, so the user can take action to remove 5 to be relative to the new BASE as well, which can cause the 

a URL that no longer exists or repeatedly hits errors. browser to jump between the htmldiff output and the original 

In addition, w3newer could be modified to keep a running document unexpectedly, 

counter of the number of times an error is encountered for The snapshot facility must address four important issues: 

a particular URL, or to skip subsequent URLs for a host if use of CGI, synchronization, resource utilization, and 

a host or network error (such as "timeout" or "network 10 security/privacy. 

unreachable") has already occurred. Addressing the problem CGI is a problem because there is no way for snapshot to 

of "noisy" modifications will require heuristics to examine interact with the user and the user's browser, other than by 

the differences at a semantic level. sending HTML output. When a CGI script is invoked, httpd 

In addition to providing a mechanism for determining sets up a default timeout, and if the script does not generate 

when W 2 pages have been modified, there must be a way to 15 output for a full timeout interval, httpd will return an error 

access multiple versions of a page for the purposes of to the browser. This was a problem for snapshot because the 

comparison. script might have to retrieve a page over the Internet and 

There are three possible approaches for providing ver- then do a time-consuming comparison against an archived 

sioning of W 3 pages: making each content provider keep a version. The server does not tell snapshot what a reasonable 

history of all versions, making each user keep this history, or 20 timeout interval might be for any subsequent retrievals; 

storing the version histories on an external server. instead this is hard-coded into the script. In order to keep the 

Server-side Support HTTP connection alive, snapshot forks a child process that 

Each server could store a history of its pages and provide generates one space character (ignored by the W 3 browser) 

a mechanism to use that history to produce marked -up pages every several seconds while the parent is retrieving a page 

that highlight changes. This method requires arbitrary con- 25 or executing htmldiff. 

tent providers to provide versioning and differencing, so it is Synchronization between simultaneous users of the facil- 

not practical, although it is desirable to support this feature ity is complicated by the use of multiple files for bookkeep- 

when the content provider is willing. ing. The system must synchronize access to the RCS 

Client-side Support repository, the locally cached copy of the HTML document, 

Each user could run a program that would store items in 30 and the control files that record which version of each page 

the hotlist locally, and run htmldiff against a locally saved a user has seen. Currently this is done by using UNIX file 

copy. This method requires that every page of interest be locking on both a per-URL lock file and the per-user control 

saved by every user, which is unattractive as the number of file. Ideally the locks could be queued such that if multiple 

pages in the average user's hotlist increases, and it also users request the same page simultaneously, the second 

requires the ability to run htmldiff on every platform that 35 snapshot process would just wait for the page and then 

runs a W 3 browser. Storing the pages referenced by the return, rather than repeating the work. This is not so impor- 

hotlist may not be too unreasonable, since programs like tant for making snapshots, in which case a proxy-caching 

Netscape may cache pages locally anyway. There are other server can respond to the second request quickly and RCS 

external tools such as warm list as shown in "Warmlist, can easily determine that nothing has changed, but there is 

http: //glimpse .cs.arizona.edu: 1994/paul/warmlist/ , 'that pro- 40 no reason to run htmldiff twice on the same data, 

vide this functionality. The latter point relates to the general issue of resource 

External Service utilization. Snapshot has the potential to use large amounts 

The approach is to run a service that is separate from both of both processing and disk space. The need to execute 

the content provider and the client. Pages can be registered htmldiff on the server can result in high processor loads if 

with the service via an HTML form, and differences can be 45 the facility is heavily used. These loads can be alleviated by 

retrieved in the same fashion. Once a page is stored with the caching the output of htmldiff for a while, so many users 

service, subsequent requests to remember the state of the who have seen version N and N+l of a page could retrieve 

page result in an RCS "check-in" operation that saves only htmldiff(page Ar ,page A r +1 ) with a single invocation of 

the differences between the page and its previously checked- htmldiff. The facility could also impose a limit on the 

in version. Thus, except for pages that change in many 50 number of simultaneous users, or replicate itself among 

respects at once, the storage overhead is minimal beyond the multiple computers, as many W 3 services do. 

need to save a copy of the page in the first place. Disk space is potentially a problem if the repository can 

Drawbacks to the "external service" approach are that the grow without bound and with no cost to its users. In fact, 

service must remember the state of every page that anyone before a service like this could be placed on the Internet, it 

who uses the service has indicated an interest in and must 55 would have to authenticate each user and limit the user to a 

know which user has seen which version of each page. The fixed number of URLs and/or disk blocks. Most likely, one 

first issue is primarily one of resource allocation, and is not would use an Internet commerce facility to charge a fee in 

expected to be a significant issue unless the service is used exchange for permission to store a collection of URLs: this 

by a great many clients on a number of large pages. The fee could easily offset the cost of the storage medium since 

second issue is addressed by using RCS's support for 60 it would also be paying for the differencing service, 

datestamps and requesting a page as it existed at a particular Lastly, security and privacy are important. Because the 

time. Alternatively, a version number could be retained for CGI scripts run with minimal privileges, from an account to 

each <user, URL> combination. which many people have access, the data in the repository is 

Relative links become a problem when a page is moved vulnerable to any CGI script and any user with access to the 

away from the machine that originally provided it. If the 65 CGI area. Data in this repository can be browsed, altered, or 

source were passed along unmodified, then the W 3 browser deleted. In order to use the facility one must give an 

would consider links to be relative to the CGI directory identifier (currently one's email address, which anyone can 
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specify) that is used subsequently to compare version num- 
bers. Browsing the repository can therefore indicate which 
user has an interest in which page, how often the user has 
saved a new checkpoint, and so on. 

By moving to an authenticated system on a secure 5 
machine, one could break some of these connections and 
obscure individuals' activities while providing better secu- 
rity. The repository would associate impersonal account 
identifiers with a set of URLs and version numbers, and 
passwords would be needed to access one of these accounts. 10 
Whoever administers this facility, however, will still have 
information about which user accesses which pages, unless 
the account creation can be done anonymously. 

So far, only a small fraction of pages on the W 3 contain 
information that allows users to ascertain how the pages 15 
have changed-examples include icons that highlight recent 
additions, a link to a "changelog", or a special "what's new" 
page. As was mentioned in the introduction, these 
approaches suffer from deficiencies. They are intended to be 
viewed by all users, but users will visit the pages at different 20 
intervals and have different ideas of "what's new". In 
addition, the maintainer must explicitly generate the list of 
recent changes, usually by manually marking up the HTML. 

Automatic comparison of HTML pages and generation of 
marked-up pages frees the HTML provider from having to 25 
determine what's new and creating new or modified HTML 
pages to point to the differences. There are many ways to 
compare documents and many ways to present the results. 

HTML separates content (raw text) from markups. While 
many markups (such as <P>, <I>, and <HR>) simply change 30 
the formatting and presentation of the raw text, certain 
markups such as images (<IMG src= . . . >) and hypertext 
references (<A href -...>) are "content-defining." 
Whitespace in a document does not provide any content 
(except perhaps inside a <PRE>), and should not impact 35 
comparison. 

At one extreme, one can view an HTML document as 
merely a sequence of words and "content-defining" mark- 
ups. Markups that are not "content-defining" as well as 
whitespace are ignored for the purposes of comparison. The 40 
fact that the text inside <P> . , . </P> is logically grouped 
together as a paragraph is lost. As a result, if one took the 
text of a paragraph comprised of four sentences and turned 
it into a list (<UL>) of four sentences (each starting with 
<LI>), no difference would be flagged because the content 45 
matches exactly. 

At the other extreme, one can view HTML as a hierar- 
chical document and compare the parse tree or abstract 
syntax tree representations of the documents, using sub-tree 
equality (or some weaker measure) as a basis for compari- 50 
son. In this case, a subtree representing a paragraph 
(<P> . . . </P>) might be incomparable with a subtree 
representing a list (<UL> . . . </LTL>). The example of 
replacing a paragraph with a list would be flagged as both a 
content and format change. 55 

An HTML document is viewed as a sequence of sentences 
and "sentence-breaking" markups (such as <P>, <HR>, 
<Ii>, or <H1>) where a "sentence" is a sequence of words 
and certain (non -sentence-breaking) markups (such as <B> 
or <A>). A "sentence" contains at most one English 60 
sentence, but may be a fragment of an English sentence. All 
markups are represented and are compared, regardless of 
whether or not those markups are "content-defining." In the 
paragraph-to-list example, the comparison would show no 
change to content, but a change to the formatting. 65 

Hirshberg's solution is applied to the longest common 
subsequence (LCS) problem as shown in "D. S. Hirschberg. 
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A linear space algorithm for computing maximal common 
subsequences. Communications of the ACM, 18(6):34t-343, 
June 1975" and in "D. S. Hirschberg, Algorithms for the 
longest common subsequence problem. Journal of the ACM, 
24(4):664-675, October 1977",(with several speed 
optimizations) to compare HTML documents. This is the 
well-known comparison algorithm used by the Unix diffi- 
culty in "J. W. Hunt and M. D. Mcllroy. An algorithm for 
differential file comparison. Technical Report Computing 
Science TR#41, Bell Laboratories, Murray Hill, N.J., 1975". 
The LCS problem is to find a (not necessarily contiguous) 
common subsequence of two sequences of tokens that has 
the longest length (or greatest weight). Tokens not in the 
LCS represent changes. In Unix diff a token is a textual line 
and each line has weight equal to 1. In htmldiff a token is 
either a sentence-breaking markup or a sentence, which 
consists of a sequence of words and non-sentence -breaking 
markups. Note that the definition of sentence is not recur- 
sive; sentences cannot contain sentences. A simple lexical 
analysis of an HTML document creates the token sequence 
and converts the case of the markup name and associated 
(variable,value) pairs to upper-case; parsing is not required. 

It is now described how the weighted LCS algorithm 
compares two tokens and computes a non-negative weight 
reflecting the degree to which they match (a weight of 0 
denotes no match). Sentence-breaking markups can only 
match sentence-breaking markups. They must be identical 
(modulo whitespace, case, and reordering of (variable,value) 
pairs) in order to match (see section 4.3 for a discussion of 
the ramifications of this). A match has weight equal to 1. 
Sentences can match only sentences, but sentences need not 
be identical to match one another. Two steps are used to 
determine whether or not two sentences match. The first step 
uses sentence length as a comparison metric. Sentence 
length is defined to be the number of words and "content- 
defining" markups such as <IMG> or <A> in a sentence. 
Markups such as <B> or <I> are not counted. If the lengths 
of two sentences are not "sufficiently close," then they do not 
match. Otherwise, the second step computes the LCS of the 
two sentences (where words matching exactly against words 
are assigned weight 1, and markups match exactly against 
markups, as before). Let W be the number of words and 
content-defining markups in the LCS of the two sentences 
and let L be the sum of the lengths of the two sentences. If 
the percentage (2*W)L is sufficiently large, then the sen- 
tences match with weight W. Otherwise, they do not match. 

The comparison algorithm outlined above yields a map- 
ping from the tokens of the old document to the tokens of the 
new document. Tokens that have a mapping are termed 
"common"; tokens that are in the old (new) document but 
have no counterpart in the new (old) are "old" ("new"), 
"old" and "new" tokens are referred to as "differences". 
Below are listed and described the three basic ways to 
present the differences by creating HTML documents that 
highlight the differences with a variety of markup tech- 
niques: 
Side-by-Side 

A side-by-side presentation of the documents with com- 
mon text vertically synchronized is a very popular and 
pleasing way to display the differences between documents 
(see, for example, Unix sdiff or SGI's graphical diff tool 
gdiff. Unfortunately, there is no good mechanism in place 
with current 1-ITMIL and browser technology that allows 
such synchronization (although it might be possible to make 
a document that contained a table with a document per 
column in which rows of the table were used to achieve 
synchronization). 
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Only Differences 

Show only differences (old and new) and eliminate the 
common part (as done in Unix diff). This optimizes for the 
"common" case, where there is much in common between 
the documents. This is especially useful for very large' 5 
documents but can be confusing because of the loss of 
surrounding common context. Another problem with this 
approach is that an HTML document comprised of an 
interleaving of old and new fragments might be syntactically 
incorrect. 10 
Merged -page 

Create an HTML page that summarizes all of the 
common, new, and old material. This has the advantage that 
the common material is displayed just once (unlike the 
side-by-side presentation). However, incorporating two is 
pages into one again raises the danger of creating syntacti- 
cally or semantically incorrect HTML, (consider converting 
a list of items into a table, for example). 

The preference is to present the differences in the merged - 
page format to provide context and use internal hypertext 20 
references to link the differences together in a chain so the 
user can quickly jump from difference to difference. The 
syntactic/semantic problem of merging is currently dealt by 
eliminating all old markups from the merged page (note that 
this doesn* t mean all markups in the older document, just the 25 
ones classified as "old" by the comparison algorithm). As a 
result, old hypertext references and images do not appear in 
the merged page (of course, since they were deleted they 
may not be accessible anyway). However, by reversing the 
sense of "old" and "new"' one can create a merged page with 30 
the old markups intact and the new deleted. A more Draco- 
nian option would be to leave out all old material. In this 
case, there are no syntactic problems given that the most 
recent page is syntactically correct to begin with; the merged 
page is simply the most recent page plus some markups to 35 
point to the new material. Other ways to create a merged 
page is being explored. 

An example of htlmdiff's merged-page output appears in 
FIG. 1. Markups are used to highlight old and new material 
as follows. Two small arrow images are used to point to 40 
areas in the document that have changed. A red arrow points 
to old content and a green arrow points to new content. The 
arrows are also internal hypertext references to one another, 
linked in a chain to allow quick traversal of the differences. 
A banner at the front of the document contains a link to the 45 
first difference. Old text is displayed in "struck-out" font 
using <STRIKE>, which is rarely used in HTML found on 
the W 3 . Unfortunately, there is no ideal font for showing 
"new" text. Currently <STRONGxI> is used. Ideally, it 
would be desirable to color code the text or text background 50 
to highlight old and new text, but this capability is not 
provided by current browsers. Another approach would be to 
choose a font that is not active at the point of the difference. 

Note that not all changes in the documents are high- 
lighted. For example, new markups that are not "content- 55 
defining" (such as <P>) are not marked up. However, 
markups such as anchors are highlighted. Consider the 
example of changing the URL in an anchor but not the 
content surrounded by <A> . . . </A>. In this case, an arrow 
will point to the text of the anchor, but the text itself will be 60 
in its original font, signifying a change to just the URL. 

Since htmldiff can parse an HTML document and rectify 
certain syntactic problems, such as mismatched or missing 
markups, the only real problem it is likely to encounter is a 
set of changes that are so pervasive as to make the resulting 65 
merged HTML unreadable. For instance, if every other line 
were changed, then the mixture of unrelated struck-out and 
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emphasized text would be muddled. The experimenting with 
methods is being done for varying the degree to which old 
and new text can be interspersed, as well as thresholds to 
specify when the changes are too numerous to display 
meaningfully. 

Currently, htmldiff is neither "version-aware" nor "web- 
ware". That is, hrmldiff only compares the text of two 
HTML pages. It does not compare versions of the entities 
that the pages refer to, access them, or invoke itself recur- 
sively on other referenced pages. This has a number of 
consequences. The good news is that htmldiff does not incur 
the overhead of pulling versions from a repository or send- 
ing requests over the W 3 for information. This cost is 
consumed by w3newer and snapshot The bad news is that 
some differences may be ignored. For example, if the 
contents of an image file are changed but the URL of the file 
does not, then the URL in the page will not be flagged as 
changed. To support such comparison would require some 
sort of versioning of referenced entities and would also 
require htmldiff to have access to the version repositories. 
Full versioning of all entities would allow interesting com- 
parisons to be done, but would dramatically increase storage 
requirements. A cheaper alternative would be to store a 
checksum of each entity and use the checksums to determine 
if something has changed. It is being explored on how to 
efficiently perform such "smarter" comparisons. 

There are two entry points to NO HANDS, one through 
w3 newer and one through snapshot. Currently, w3newer is 
invoked directly by the user, probably by a crontab entry, 
and generates an HTML document indicating which pages 
have changed. If specified, w3 newer will associate three 
links with each document in the hotlist: 
Remember 

Send the URL to the snapshot facility, to save a copy of 
the page. Though the page is retrieved, the RCS ci command 
ensures that it is not saved if it is unchanged from the 
previous time it was stored away. 
Diff 

Have the snapshot facility invoke htmldiff to display the 
changes in a page since it was last saved away by the user. 
History 

Have snapshot display a full log of versions of this page, 
with the ability to run htmldiff on any pair of versions or to 
view a particular version directly. (See FIG. 2.) 

Thus, each page that is reported as "new** can immedi- 
ately be passed to htmldiff, and any page in the list can be 
"remembered" for future use. An example of w3newer's 
output appears in FIG. 3. 

A user may also choose to enter snapshot directly to 
check-in pages, or view the current page or the version 
history. FIG, 4 shows the interface to NO HANDS through 
snapshot. If the user selects the history link, the page shown * 
in FIG. 2 is presented. Finally, selecting two pages to 
compare invokes htmldiff as in FIG. 1. 

One disadvantage of the current approach is that there is 
no direct interaction between w3newer, snapshot, and the 
W 3 browser. Viewing a page with htmldiff does not cause the 
browser to record that the page has just been seen; instead, 
the browser records the URL that was used to invoke 
htmldiff in the first place. Subsequently, w3newer uses the 
obsolete datestamp from the browser and continues to report 
that the page has been modified more recently than the 
browser has seen it. As a result, the user must view a page 
directly as well as via htmldiff in order to both remove it 
from the list of modified pages and see the actual differ- 
ences. 

This section describes some possible extensions to the 
work already presented. Section 6.1 discusses an interface 
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between RCS and htmldiff that is already implemented, Instead, the browser could be modified to have better 

while Sections 6.2 and 6.3 presents unimplemented exten- support for forms: 

sions to integrate tracking modifications into the server and . , 4 . _„ , . _ . . 

to invoke scripts via the HTTP POST protocol. 11 * h ™ ld * to " the fi " ed -° ut ve * lon of " form m ? 

Hie tools described above do not require any changes to 5 bookmark file, so the user could jump directly to the 

arbitrary servers or clients on the W 3 . Existing GET and 0Ut P ut of a CCI scn P L 

POST protocols are used to communicate with specific It should be able to pass a form directly to NO HANDS, 

servers that save versions of documents and provide along with the URL specified in the FORM tag, so that 

marked-up versions showing how they have changed. the output could be stored under RCS. 

However, if a server runs htmldiff and some perl scripts, it 10 NO HANDS combines notification, archiving, and ditfer- 

can provide a direct version-control interface and avoid the encing of W 3 pages into a single cohesive tool. It achieves 

need to store copies of its HTMIL documents elsewhere. economies of scale by avoiding unnecessary HTTP accesses, 

The perl scripts so far written provide an interface to RCS sav ing pages at most once each time they are modified 

as shown in "W. Tichy. RCS: a system for version control. (regardless of the number of users who track it), and using 

Software-Practice & Experience. 15(7): 637-654, July is RCS as the underlying versioning system. Automatic gen- 

1 1 985 "' ^L 801 ? 1 (^-b^g)^^^^ °*P* of erationof differences within the HTML framework provides 

rlog into HTMU showing the user a history of the document users ^ the abmt tQ ^ botfa insertions and deletioQS m 

with links to view any specific version or to see the differ- „ „ nnxr „ n :„ ni f , • 

. _ ^ J r . * . . 4 /t ... , . a convenient tasnion. 

ences between two versions. Another script (/cgibin/co) r , ti . c iL „, 3 , , 4 t . . 

, c , . * \ i !_•/ In the general setting of the W and document retrieval 

displays a version of a document under RCS control, while 20 XT „ 11A J^c , Ci . r , 

still another (/cgi-bin/rcsdifl) displays the differences. If the *° HANDS benefits two communities: users of the no 

file's name ends in html then htmldiff is used to display the lon § er have t0 browse t0 M P a * es of mterest that have 

differences, rather than the rcsdiff program. changed; HTML providers no longer have to create suitably 

As an example, one might set up a Last-Modified field at marked-up pages to show "what's new". While such auto- 
the bottom of an HTML document to be a link to the rlog 25 matI0D 1S clearl y helpful m tfa is general context, it is 
script, with the document name specified as a parameter. expected that NO HANDS will be a critical part of more 
After clicking on this unobtrusive field, the user would be focused uses of the W 3 , especially in areas involving col- 
able to see the history of the document. laborative and distributed work. 

Currently, w3newer runs on the user's machine, so mul- Several issues still need to be addressed. In particular, 

tiple instantiations of the script may perform the same work. 30 many of the complications of NO HANDS could be avoided 

Although it runs a related daemon on the same machine as by better integration with W 3 browsers and servers. For 

an AT&T- wide proxy-caching server, which returns infor- instance, viewing the difference between an older version of 

mation about pages that are currently cached on the server a page and its current version should update the browser's 

and may eliminate some accesses over the Internet, there is notion of when the page was last visited. Finally, the 

insufficient locality in that cache for it to eliminate a 35 increasing availability of distributed, hierarchical HTTP 

significant fraction of requests. repositories such as shown in "C. Mic Bowman et al. 

Alternatively, w3newer could be run on the set of pages Harvest: A scalable, customizable discovery and access 

that have been saved by the snapshot daemon. Regardless of tem Technical Report CU-CS-732-94, Dept. of Com- 

how many users have registered an interest in a page, it need f Sc[ University 0 f Colorado-Boulder, March 

only be checked once: if changed, the new version could be 40 i nnr „ „ . , tl _ „ - 4 , , r 

' , . ™ u i- r ii 1995 , will be both an opportunity and a challenge for 

saved automatically. Then a user could request a list of all . , . 4 . , . *\ , . 

, . . j j . ■ j- *• c scalable notification mechanisms and version archives, 

pages that have been saved away, and get an indication of VT . , JjC i , 

£• u u u j ju*u Numerous substitutions and modifications can be under- 

which pages have changed since they were saved by the A . . , c iL • j c . 

r ° ° J J taken without departing from the true spirit and scope of the 

... c ... , £■ i . u invention. What is desired to be secured by Letters Patent is 

Adding this tunctionahty would be useful, smce it would 45 lL . t . , r , . 4 , c „ ... 

«. ° p it* u i_ . . j . . c the invention as defined in the following claims, 

offer economies of scale. It would have the disadvantage of ^ claim* 

being decoupled from a given user's W 3 browser history; A 4 . r . . , 4 , 

i.e., if a user views a page directly, the snapshot facility \. A ^ *?' m ° m ' onn S ch »8» J° a d °~ stored 

would have no indication of this and might present the page 00 the World W,de Web ' «"»P»«W lhe sle P s of : 

as having been modified. 50 copying an original document selected by a user from the 

Because NO HANDS can handle arbitrary URLs, it can World Wide Web to create a copied document on a 

interact with CGI scripts that use the GET protocol by server separate from the World Wide Web and under 

passing arguments to the script as part of the URL. However, independent control; 

services that use POST cannot be accessed, because the monitoring for changes in the original document; 

input to the services is not stored. 55 .» rt „ t u a , • 

*! , „ . . ., t , .._ , archiving, on the separate server, the changes in the 

Both w3newer and snapshot would have to be modified to ■ • i j * j * * j j • i_ ■* • 

. l. r.^o-n i • j 7 , , onginal document, as detected dunng such monitoring; 

support the POST protocol, in order to invoke a service and . . . .... 7 , 

see if the result has changed, and then to store away the stonD S vanous versi0DS of the on S inal document on the 

result and display the changes if it has. The interface to NO separate server; 

HANDS to support POST is unclear, however. A user could 60 presenting to the user, in response to a request to access 

manually save the source to an HTML form and change the th e original document, a current version of the original 

URL the form invokes to be something provided by NO document as archived, and an option to compare 

HANDS. It, in turn, would have to make a copy of its input selected versions, as archived, 

to pass along to the actual service. The result would be en 2 - A method according to claim 1 further comprising the 

HTTP equivalent of a UNIX pipe, interposing an extra 65 ste P 

service between the browser and the service the user is presenting to the user, an option to view a history of 

trying to invoke. different versions of the original document. 
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3. A method according to claim 1 further comprising the 
step of: 

recording the times when the user accesses each docu- 
ment. 

4. A method according to claim 1, further comprising the 
steps of: 

comparing the current version of the original document as 
archived with the copied document. 

5. A method according to claim 1 further comprising the 
step of: 

notifying the user, the changes in the original selected 
document since the user last accessed the document. 

6. A method according to claim 5 wherein the user is 
notified upon a specific request by the user. 
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7. A method according to claim 5 wherein the user is 
notified simply by the user's access to the selected docu- 
ment. 

8. A method according to claim 5 wherein the user is 
notified by electronic mail message. 

9. A method according to claim 4 wherein the documents 
that are compared for any changes are determined by 
default. 

10. A method according to claim 4 wherein the documents 
that are compared for any changes are specified by the user. 
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